{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "name": "Text_Normalization_Tutorial.ipynb",
      "private_outputs": true,
      "provenance": [],
      "collapsed_sections": [
        "lcvT3P2lQ_GS"
      ],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.8"
    }
  },
  "cells": [
    {
      "cell_type": "code",
      "metadata": {
        "id": "a5fA5qAm5Afg"
      },
      "source": [
        "if 'google.colab' in str(get_ipython()):\n",
        "  !pip install -q condacolab\n",
        "  import condacolab\n",
        "  condacolab.install()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "x0DJqotopcyb",
        "collapsed": true
      },
      "source": [
        "\"\"\"\n",
        "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
        "\n",
        "Instructions for setting up Colab are as follows:\n",
        "1. Open a new Python 3 notebook.\n",
        "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n",
        "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n",
        "4. Run this cell to set up dependencies.\n",
        "\"\"\"\n",
        "# If you're using Google Colab and not running locally, run this cell\n",
        "\n",
        "# install NeMo\n",
        "BRANCH = 'main'\n",
        "if 'google.colab' in str(get_ipython()):\n",
        "  !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "nYsp3SH24Tj_"
      },
      "source": [
        "if 'google.colab' in str(get_ipython()):\n",
        "  ! conda install -c conda-forge pynini=2.1.3\n",
        "  ! mkdir images\n",
        "  ! wget https://github.com/NVIDIA/NeMo/blob/$BRANCH/tutorials/text_processing/images/deployment.png -O images/deployment.png\n",
        "  ! wget https://github.com/NVIDIA/NeMo/blob/$BRANCH/tutorials/text_processing/images/pipeline.png -O images/pipeline.png"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "CH7yR7cSwPKr"
      },
      "source": [
        "import os\n",
        "import wget\n",
        "import pynini\n",
        "import nemo_text_processing\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "F-IrnmXMTevr"
      },
      "source": [
        "# Task Description\n",
        "\n",
        "Text normalization (TN) is a part of the Text-To-Speech (TTS) pre-processing pipeline. It could also be used for pre-processing Automatic Speech Recognition (ASR) training transcripts.\n",
        "\n",
        "TN is the task of converting text in written form to its spoken form to improve TTS. For example, `10:00` should be changed to `ten o'clock` and `10kg` to `ten kilograms`."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xXRARM8XtK_g"
      },
      "source": [
        "# NeMo Text Normalization\n",
        "\n",
        "NeMo TN is based on weighted finite-state\n",
        "transducer (WFST) grammars. The tool uses [`Pynini`](https://github.com/kylebgorman/pynini) to construct WFSTs, and the created grammars can be exported and integrated into [`Sparrowhawk`](https://github.com/google/sparrowhawk) (an open-source version of [The Kestrel TTS text normalization system](https://www.cambridge.org/core/journals/natural-language-engineering/article/abs/kestrel-tts-text-normalization-system/F0C18A3F596B75D83B75C479E23795DA)) for production. The NeMo TN tool can be seen as a Python extension of `Sparrowhawk`. \n",
        "\n",
        "Currently, NeMo TN provides support for English and the following semiotic classes from the [Google Text normalization dataset](https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish):\n",
        "DATE, CARDINAL, MEASURE, DECIMAL, ORDINAL, MONEY, TIME, TELEPHONE, ELECTRONIC, PLAIN. We additionally added the class `WHITELIST` for all whitelisted tokens whose verbalizations are directly looked up from a user-defined list.\n",
        "\n",
        "The toolkit is modular, easily extendable, and can be adapted to other languages and tasks like [inverse text normalization](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/text_processing/Inverse_Text_Normalization.ipynb). The Python environment enables an easy combination of text covering grammars with NNs. \n",
        "\n",
        "The rule-based system is divided into a classifier and a verbalizer following  [Google's Kestrel](https://www.researchgate.net/profile/Richard_Sproat/publication/277932107_The_Kestrel_TTS_text_normalization_system/links/57308b1108aeaae23f5cc8c4/The-Kestrel-TTS-text-normalization-system.pdf) design: the classifier is responsible for detecting and classifying semiotic classes in the underlying text, the verbalizer the verbalizes the detected text segment. \n",
        "In the example `The alarm goes off at 10:30 a.m.`, the tagger for TIME detects `10:30 a.m.` as a valid time data with `hour=10`, `minutes=30`, `suffix=a.m.`, the verbalizer then turns this into `ten thirty a m`.\n",
        "\n",
        "The overall NeMo TN pipeline from development in `Pynini` to deployment in `Sparrowhawk` is shown below (example for ITN):\n",
        "![alt text](images/deployment.png \"Inverse Text Normalization Pipeline\")\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-IT1Xr9iW2Xr"
      },
      "source": [
        "# Quick Start\n",
        "\n",
        "## Add TN to your Python TTS pre-processing workflow\n",
        "\n",
        "TN is a part of the `nemo_text_processing` package which is installed with `nemo_toolkit`. Installation instructions could be found [here](https://github.com/NVIDIA/NeMo/tree/main/README.rst)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Bfs7fa9lXDDh"
      },
      "source": [
        "from nemo_text_processing.text_normalization.normalize import Normalizer\n",
        "# creates normalizer object that works on lower cased input\n",
        "normalizer = Normalizer(input_case='cased', lang='en')\n",
        "raw_text = \"We paid $123 for this desk.\"\n",
        "normalizer.normalize(raw_text, verbose=False)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "w5sX0SXbXoZp"
      },
      "source": [
        "In the above cell, `$123` would be converted to `one hundred twenty three dollars`, and the rest of the words remain the same.\n",
        "\n",
        "## Run Text Normalization on an input from a file\n",
        "\n",
        "Use `run_predict.py` to convert a written format from a file `INPUT_FILE` to a spoken text and save the output to `OUTPUT_FILE`. Under the hood, `run_predict.py` is calling `normalize()` (see the above section)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "UD-OuFmEOX3T"
      },
      "source": [
        "# If you're running the notebook locally, update the NEMO_TEXT_PROCESSING_PATH below\n",
        "# In Colab, a few required scripts will be downloaded from NeMo github\n",
        "\n",
        "NEMO_TOOLS_PATH = '<UPDATE_PATH_TO_NeMo_root>/nemo_text_processing/text_normalization'\n",
        "DATA_DIR = 'data_dir'\n",
        "os.makedirs(DATA_DIR, exist_ok=True)\n",
        "\n",
        "if 'google.colab' in str(get_ipython()):\n",
        "    NEMO_TOOLS_PATH = '.'\n",
        "\n",
        "    required_files = ['run_predict.py',\n",
        "                      'run_evaluate.py']\n",
        "    for file in required_files:\n",
        "        if not os.path.exists(file):\n",
        "            file_path = 'https://raw.githubusercontent.com/NVIDIA/NeMo/' + BRANCH + '/nemo_text_processing/text_normalization/' + file\n",
        "            print(file_path)\n",
        "            wget.download(file_path)\n",
        "elif not os.path.exists(NEMO_TOOLS_PATH):\n",
        "      raise ValueError(f'update path to NeMo root directory')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "d4T0gXHwY3JZ"
      },
      "source": [
        "INPUT_FILE = f'{DATA_DIR}/test.txt'\n",
        "OUTPUT_FILE = f'{DATA_DIR}/test_tn.txt'\n",
        "\n",
        "! echo \"The alarm went off at 10:00.\" > $DATA_DIR/test.txt\n",
        "! cat $INPUT_FILE\n",
        "! python $NEMO_TOOLS_PATH/run_predict.py --input=$INPUT_FILE --output=$OUTPUT_FILE --language='en'"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "F5wSJTI8ZFRg"
      },
      "source": [
        "# check that the raw text was converted to the spoken form\n",
        "! cat $OUTPUT_FILE"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RMT5lkPYzZHK"
      },
      "source": [
        "## Run evaluation\n",
        "\n",
        "[Google Text normalization dataset](https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish) consists of 1.1 billion words of English text from Wikipedia, divided across 100 files. The normalized text is obtained with The Kestrel TTS text normalization system).\n",
        "\n",
        "To run evaluation, the input file should follow the Google Text normalization dataset format. That is, every line of the file needs to have the format `<semiotic class>\\t<unnormalized text>\\t<self>` if it's trivial class or `<semiotic class>\\t<unnormalized text>\\t<normalized text>` in case of a semiotic class.\n",
        "\n",
        "\n",
        "Example evaluation run:\n",
        "\n",
        "\n",
        "`python run_evaluate.py \\\n",
        "        --input=./en_with_types/output-00001-of-00100 \\\n",
        "        [--language LANGUAGE] \\\n",
        "        [--input_case INPUT_CASE] \\\n",
        "        [--cat CATEGORY]`\n",
        "\n",
        "Use `--cat` to specify a `CATEGORY` to run evaluation on (all other categories are going to be excluded from evaluation). The option `--input_case` tells the algorithm that the input is either lower cased or cased.\n",
        "\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "u4zjeVVv-UXR"
      },
      "source": [
        "eval_text =  \"\"\"PLAIN\\ton\\t<self>\n",
        "DATE\\t22 july 2012\\tthe twenty second of july twenty twelve\n",
        "PLAIN\\tthey\\t<self>\n",
        "PLAIN\\tworked\\t<self>\n",
        "PLAIN\\tuntil\\t<self>\n",
        "TIME\\t12:00\\ttwelve o'clock\n",
        "<eos>\\t<eos>\n",
        "\"\"\"\n",
        "INPUT_FILE_EVAL = f\"{DATA_DIR}/test_eval.txt\"\n",
        "with open(INPUT_FILE_EVAL, 'w') as fp:\n",
        "  fp.write(eval_text)\n",
        "! cat $INPUT_FILE_EVAL"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "G7_5oXpObizP"
      },
      "source": [
        "! python $NEMO_TOOLS_PATH/run_evaluate.py --input=$INPUT_FILE_EVAL --language='en'"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bIvKBwRcH_9W"
      },
      "source": [
        "`run_evaluate.py` call will output both **sentence level** and **token level** accuracies. \n",
        "For our example, the expected output is the following:\n",
        "\n",
        "```\n",
        "Loading training data: data_dir/test_eval.txt\n",
        "Sentence level evaluation...\n",
        "- Data: 1 sentences\n",
        "100% 1/1 [00:00<00:00, 14.24it/s]\n",
        "- Normalized. Evaluating...\n",
        "- Accuracy: 1.0\n",
        "Token level evaluation...\n",
        "- Token type: PLAIN\n",
        "  - Data: 4 tokens\n",
        "100% 4/4 [00:00<00:00, 239.56it/s]\n",
        "  - Denormalized. Evaluating...\n",
        "  - Accuracy: 1.0\n",
        "- Token type: DATE\n",
        "  - Data: 1 tokens\n",
        "100% 1/1 [00:00<00:00, 33.69it/s]\n",
        "  - Denormalized. Evaluating...\n",
        "  - Accuracy: 1.0\n",
        "- Token type: TIME\n",
        "  - Data: 1 tokens\n",
        "100% 1/1 [00:00<00:00, 94.84it/s]\n",
        "  - Denormalized. Evaluating...\n",
        "  - Accuracy: 1.0\n",
        "- Accuracy: 1.0\n",
        " - Total: 6 \n",
        "\n",
        " - Total: 6 \n",
        "\n",
        "Class      | Num Tokens | Normalization\n",
        "sent level | 1          | 1.0  \n",
        "PLAIN      | 4          | 1.0  \n",
        "DATE       | 1          | 1.0  \n",
        "CARDINAL   | 0          | 0    \n",
        "LETTERS    | 0          | 0    \n",
        "VERBATIM   | 0          | 0    \n",
        "MEASURE    | 0          | 0    \n",
        "DECIMAL    | 0          | 0    \n",
        "ORDINAL    | 0          | 0    \n",
        "DIGIT      | 0          | 0    \n",
        "MONEY      | 0          | 0    \n",
        "TELEPHONE  | 0          | 0    \n",
        "ELECTRONIC | 0          | 0    \n",
        "FRACTION   | 0          | 0    \n",
        "TIME       | 1          | 1.0  \n",
        "ADDRESS    | 0          | 0 \n",
        "\n",
        "```\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "L85ZaUJ_4TkF"
      },
      "source": [
        "# C++ deployment\n",
        "\n",
        "The instructions on how to export `Pynini` grammars and to run them with `Sparrowhawk`, could be found at [NeMo/tools/text_processing_deployment](https://github.com/NVIDIA/NeMo/tree/main/tools/text_processing_deployment)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ENMDNl9C4TkF"
      },
      "source": [
        "# WFST and Common Pynini Operations\n",
        "\n",
        "See [NeMo Text Inverse Normalization Tutorial](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/text_processing/Inverse_Text_Normalization.ipynb) for details."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lcvT3P2lQ_GS"
      },
      "source": [
        "# References and Further Reading:\n",
        "\n",
        "\n",
        "- [Zhang, Yang, Bakhturina, Evelina, Gorman, Kyle and Ginsburg, Boris. \"NeMo Inverse Text Normalization: From Development To Production.\" (2021)](https://arxiv.org/abs/2104.05055)\n",
        "- [Ebden, Peter, and Richard Sproat. \"The Kestrel TTS text normalization system.\" Natural Language Engineering 21.3 (2015): 333.](https://www.cambridge.org/core/journals/natural-language-engineering/article/abs/kestrel-tts-text-normalization-system/F0C18A3F596B75D83B75C479E23795DA)\n",
        "- [Gorman, Kyle. \"Pynini: A Python library for weighted finite-state grammar compilation.\" Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata. 2016.](https://www.aclweb.org/anthology/W16-2409.pdf)\n",
        "- [Mohri, Mehryar, Fernando Pereira, and Michael Riley. \"Weighted finite-state transducers in speech recognition.\" Computer Speech & Language 16.1 (2002): 69-88.](https://cs.nyu.edu/~mohri/postscript/csl01.pdf)"
      ]
    }
  ]
}